Unsupervised Speech Segmentation and Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding

نویسندگان

چکیده

Typically, unsupervised segmentation of speech into the phone- and word-like units are treated as separate tasks often done via different methods which do not fully leverage inter-dependence two tasks. Here, we unify them propose a technique that can jointly perform both, showing these indeed benefit from each other. Recent attempts employ self-supervised learning, such contrastive predictive coding (CPC), where next frame is predicted given past context. However, CPC only looks at audio signal’s frame-level structure. We overcome this limitation with segmental (SCPC) framework to model signal structure higher level, e.g., phone level. A convolutional neural network learns representation raw waveform noise-contrastive estimation (NCE). differentiable boundary detector finds variable-length segments, then used optimize segment encoder NCE learn representations. The allows us train segment-level encoders jointly. Experiments show our single outperforms existing word on TIMIT Buckeye datasets. analyze impact threshold performance, results suggest automatically learning be effective manually tuning threshold. discover class impacts detection boundaries between successive vowels or semivowels most difficult. Finally, use SCPC extract features level rather than uniformly spaced (e.g., 10 ms) produce variable rate representations change according contents utterance. lower feature extraction typical 100 Hz low 14.5 average while still outperforming hand-crafted MFCC linear classification task.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Techniques for Variable Rate Speech Coding Using

This paper presents two techniques for variable rate speech coding using wavelet representations. In the rst method, the Daubechies wavelet is used to produce a wavelet representation of the speech signal. This representation is encoded using some of the properties of the speech signal to allocate bits to the various subbands. A technique to lower the bit rate using vector quantisation is discu...

متن کامل

Variable Rate Speech Coding Using Discrete Time Waveletextrema

We describe a novel procedure for variable rate encoding of a speech signal starting from the discrete time wavelet extrema representation. We describe the bit reduction achievable by thresholding the extrema signal. We also demonstrate that the thresholding procedure provides a \denoising" eeect. We then reduce the bit rate further by using a bit allocation scheme that adopts a model for the e...

متن کامل

Unsupervised Texture Image Segmentation Using MRFEM Framework

Texture image analysis is one of the most important working realms of image processing in medical sciences and industry. Up to present, different approaches have been proposed for segmentation of texture images. In this paper, we offered unsupervised texture image segmentation based on Markov Random Field (MRF) model. First, we used Gabor filter with different parameters’ (frequency, orientatio...

متن کامل

Unsupervised Texture Image Segmentation Using MRFEM Framework

Texture image analysis is one of the most important working realms of image processing in medical sciences and industry. Up to present, different approaches have been proposed for segmentation of texture images. In this paper, we offered unsupervised texture image segmentation based on Markov Random Field (MRF) model. First, we used Gabor filter with different parameters’ (frequency, orientatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing

سال: 2022

ISSN: ['2329-9304', '2329-9290']

DOI: https://doi.org/10.1109/taslp.2022.3180684